Automatic Scanner Generation

It is possible to generate a scanner for a programming language automatically. How can this be done? Remembering what you learned from theoretical computer science will help.

Equivalence of Regular Expressions and Finite State Automata

First, recall that the theory shows that finite state automata and regular expressions are equivalent. That is, the following two lemmas have been proved in the theory of computing.

Lemma 1: For each regular expression r, there is a finite state automaton M such that L(M) = L(r).

Proof: By construction (that is, an algorithm is given that converts any instance of a regular expression into a finite state automaton that recognizes exactly the set that the regular expression denotes, or stands for):

              r     regular expression to      M
            ------>  finite state automaton  ------->
                     conversion algorithm

This implication of this lemma is that finite state automata are at least as powerful as regular expressions, because for each regular expression we can construct a finite state automaton that recognizes the set of strings that the regular expression stands for. One might wonder whether finite state automata can be constructed that recognize languages that cannot be expressed by regular expressions. The next lemma give the answer.

Lemma 2: For each finite state automaton M there is a regular expression r such that L(r) = L(M).

Proof: By construction (that is, an algorithm is given that converts any instance of a finite state automaton into a regular expression that denotes exactly the set of strings that the finite state automaton recognizes):

           M      finite state automaton     r
          -------> to regular expression    ------->
                   conversion algorithm

Together with Lemma 1, Lemma 2 just implies that regular expressions and finite state automata are equivalent.

Application of Lemma 1 to Scanning

The great thing about the theory of regular expressions and finite state automata is that it can be applied directly in a very practical way to scanning. For example, look at Lemma 1. Since the theoreticians have already given us the algorithm for converting regular expressions to finite state automata, this means that we can do the following:

Develop by hand the regular expressions for each token in the programming language we are trying to scan
Run each regular expression through the given algorithm to get the corresponding finite state automaton automatically.

There is one slight problem, though. The output of the regular expression to fsa conversion algorithm is usually a nondeterministic finite state automaton (nfsa). Fortunately, more theory shows us how to convert any nfsa to an equivalent dfsa.

Equivalence of Deterministic and Nondeterministic FSAs

Lemma 3. For each nfsa M there is an equivalent dfsa M'such that L(M) = L(M').

Proof: By construction.

            nfsa M     nfsa to dfsa    dfsa M'
           -----------> conversion    ----------->
                        algorithm

Again, in this case the theoreticians have not only given us a theorem that states that deterministic and nondeterministic finite state automata are equivalent, but they have given us an actual algorithm for converting between any nfsa to an equivalent dfsa as part of proving their theory.

Application of Lemma 3 to Scanning

We can now use lemma 3 to further our objective of building a scanner.

Take the outputs of the previous algorithm, which are nondeterministic finite state automata that recognize the tokens of our programming language.
Run each of these nondeterministic finite state automata through the algorithm of lemma 3 to obtain deterministic finite state automata that recognize the same tokens.

Still a minor problem exists. The deterministic finite state automata that result from the algorithm of lemma 3 may have more states than necessary. Once again theory comes to our rescue.

The Existence of Minimal-State Deterministic FSAs

A final lemma from the theory of computing shows that we can find the minimal-state deterministic finite state automaton from any given deterministic finite state automaton.

Lemma 4. For each deterministic finite state automaton M there is a minimal-state deterministic finite state automaton M' with L(M') = L(M).

Proof: By construction. Once again, the theoreticians have not only proved the existence of a minimal-state finite state automaton, but they have also given us the algorithm for making such an fsa.

Application of Lemma 4 to Scanning

Following the same tack as above, we now provide the finite state automata for our tokens that are produced from the algorithm of lemma 3 as input to the algorithm of lemma 4 to get the minimal-state finite state automata for each token that we need to scan.

Tying it all together

Being able to automatically produce the minimal-state deterministic finite state automaton for each token in our programming language still seems like being a long way from producing a practical scanner. What we want is:

            formal definition       Automatic     Scanner in
           ------------------------> Scanner    ----------------->
                 of tokens           Generator       Java

In other words, we want to be able to write down all of the regular expressions for the tokens in the programming language we are trying to scan and have a program produce an actual scanner (say, in Java). As usual, when practice meets theory, we have some additional work to do. For example, we can see that the necessary minimal-finite state automata can be generated by the above algorithms for scanning. What the scanner generator can't do without some help from us is decide what to do when it finds a token. So, the formal definition of tokens given as input to the scanner generator must have a special form that not only gives the regular expressions for tokens, but directions on what to do when a token is scanned. So, input to a scanner generator might look like the following:

     ** Part 1, Auxiliary Expressions

        Digit  ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
        Letter ::= a | A | b | B | c | C ... | z | Z
         .         .         .

      ** Part 2, Tokens

         '('                            return(l_paren)
         ')'                            return(r_paren)
         ':='                           return(assign_op)
         letter {letter | digit}        return(id)
         digit {digit}                  return(integer_literal)
         "'" {^("'" | "''" | eol)} "'"  Strip_Quotes;
                                        return(string_literal)
         .         .         .

      ** Part 3, Special Procedure
s
         procedure Strip_Quotes is
            begin -- Strip_Quotes
              -- Strip the beginning and ending quotes
              -- and replace all double apostrophes in
              -- the string with a single apostrophe.
              -- Be sure to build up the lexeme in the
              -- process.
           end Strip_Quotes;

Note that the scanner that is generated automatically keeps track of the lexeme scanned, and perhaps other information such as the line and column numbers of the start of a lexeme. These may kept in public attributes, that can be accessed by whatever program uses the scanner. These things are different than the actions shown above, such as Strip_Quotes which specify other actions to be done when a string is successfully scanned.

Here we have given all of the items necessary for a scanner to be generated. You already know a standard format for building a scanner given all of the fsa's for the tokens in the language. All you have to do is add the parts to finish the scanner. For example after you insert the fsa for an id in its proper place in the code, you simply add the line "return(id)" to the code. (The use of "return(id) reflects the fact that early scanner generators produced scanners in C. If the scanner you are generating is written in an object oriented programming language, such as Java, you would likely use a public attribute token and insert the line "token = id;" instead of return(id) into your produced scanner.) After the fsa for string literals, you add the code to call the procedure Strip_Quotes (and you include procedure Strip_Quotes at the top of your program) followed by "return(string_literal)" and so forth. You must also make publicly available values for lexeme, line_number, column_number, and error_string.